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Abstract 

In the problem of multivariate regression, a X-dimensional response vector is re- 
gressed upon a common set of p covariates, with a matrix B* e M^^^ of regression 
coefficients. We study the behavior of the group Lasso using ^1/^2 regularization for 
the union support problem, meaning that the set of s rows for which B* is non-zero 
is recovered exactly. Studying this problem under high-dimensional scaling, we show 
that group Lasso recovers the exact row pattern with high probability over the random 
design and noise for scalings of (n,p, s) such that the sample complexity parameter 
given by 9{n,p, s) : — n/[2^{B*) log(p — s)] exceeds a critical threshold. Here n is the 
sample size, p is the ambient dimension of the regression model, s is the number of 
non-zero rows, and ip^B*) is a sparsity- overlap function that measures a combination 
of the sparsities and overlaps of the if-regression coefficient vectors that constitute the 
model. This sparsity-overlap function reveals that, if the design is uncorrelated on the 
active rows, block (.l|^2 regularization for multivariate regression never harms perfor- 
mance relative to an ordinary Lasso approach, and can yield substantial improvements 
in sample complexity (up to a factor of K) when the regression vectors are suitably or- 
thogonal. For more general designs, it is possible for the ordinary Lasso to outperform 
the group Lasso. We complement our analysis with simulations that demonstrate the 
sharpness of our theoretical results, even for relatively small problems. 



1 Introduction 

The development of efficient algorithms for large-scale model selection has been a major 
goal of statistical learning research in the last decade. There is now a substantial body 
of work based on £i-regularization, dating back to the seminal work of Tibshirani (19961 



and Donoho and collaborators (Chen et al. 1998 Donoho and Huo 2001 1 . The bulk of 



this work has focused on the standard problem of linear regression, in which one makes 
observations of the form 



V 



XP* + w, 



(1) 



where y £ M" is a real- valued vector of observations, w £ M" is an additive zero-mean noise 
vector, and X G M"^^ is the design matrix. A subset of the components of the unknown 
parameter vector /3* E are assumed non-zero; the model selection goal is to identify 



these coefficients and (possibly) estimate their values. This goal can be formulated in terms 
of the solution of a penalized optimization problem: 



(2) 



where ||/3||o counts the number of non-zero components in /3 and where A„ > is a regular- 
ization parameter. Unfortunately, this optimization problem is computationally intractable, 



a fact which has led various authors to consider the convex relaxation (Tibshirani 1996 



Chen et al. 1998) 



arg mm 



/3eMP I n 



1 



|y-X/3||2 + A„||/3||i 



(3) 



in which ||/3||o is replaced with the £1 norm ||/?||i. This relaxation, often referred to as the 
Lasso ( [Tibshirani 19961, is a quadratic program, and can be solved efficiently by various 
methods (e.g., Boyd and Vandenberghe 2004 Osborne et al. , 2000 Efron et al. 2004)). 
A variety of theoretical results are now in place for the Lasso, both in the traditional 



setting where the sample size n tends to infinity with the problem size p fixed (Knight and 



Fu, 2000 1, as well as under high-dimensional scaling, in which p and n tend to infinity simul- 



taneously, thereby allowing p to be comparable to or even larger than n (e.g., Meinshausen 
and Biihlmann, 2006 Wainwright 2006 Zhao and Yu 2006). In many applications, it is 



natural to impose sparsity constraints on the regression vector (3*, and a variety of such 
constraints have been considered. For example, one can consider a "hard sparsity" model 
in which /?* is assumed to contain at most s non-zero entries or a "soft sparsity" model in 
which (3* is assumed to belong to an iq ball with q < 1. Analyses also differ in terms of the 
loss functions that are considered. For the model or variable selection problem, it is natural 
to consider the {0 — l}-loss associated with the problem of recovering the unknown support 
set of P* . Alternatively, one can view the Lasso as a shrinkage estimator to be compared 
to traditional least squares or ridge regression; in this case, it is natural to study the ^2-loss 
11/3 — P*\\2 between the estimate f3 and the ground truth. In other settings, the prediction 
error K[{Y — may be of primary interest, and one tries to show risk consistency 

(namely, that the estimated model predicts as well as the best sparse model, whether or 
not the true model is sparse). 



1.1 Block-structured regularization 

While the assumption of sparsity at the level of individual coefficients is one way to give 
meaning to high-dimensional {p ^ n) regression, there are other structural assumptions 
that are natural in regression, and which may provide additional leverage. For instance, 
in a hierarchical regression model, groups of regression coefficients may be required to 
be zero or non-zero in a blockwise manner; for example, one might wish to include a 



particular covariate and all powers of that covariate as a group (Yuan and Lin 2006; Zhao 



et al. 2007). Another example arises when we consider variable selection in the setting of 
multivariate regression: multiple regressions can be related by a (partially) shared sparsity 
pattern, such as when there are an underlying set of covariates that are "relevant" across 



regressions (Obozinski et al. 2007 Argyriou et al. 2006; Turlach et al. 



[2008| ). Based on such motivations, a recent line of research (Bach et al. 



2005 


Zhang et al. 


2004 


Tropp 


2006 
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Yuan and Lin , 2006 : Zhao et al. 2007 Obozinski et al. 2007 Ravikumar et al. 2008 ) has 



studied the use of block- regularization schemes, in which the £i norm is composed with some 
other iq norm (q > 1), thereby obtaining the £i/iq norm defined as a sum of £q norms over 
groups of regression coefficients. The best known examples of such block norms are the 
oo norm ([Turlach et al.^ |2005[ [Zhang et al.[ [20081) , and the ii / £2 norm (jObozinski et al. 



20071. 



In this paper, we investigate the use of £i/£2 block-regularization in the context of high- 
dimensional multivariate linear regression, in which a collection of K scalar outputs are 
regressed on the same design matrix X € M^^^. Representing the regression coefficients as 
an p X K matrix B*, the multivariate regression model takes the form 



Y 



XB* + W, 



(4) 



where Y G M"^^ and W G M"^-^ are matrices of observations and zero-mean noise respec- 
tively. In addition, we assume a hard-sparsity model for the regression coefficients in which 
column j of the coefficient matrix B* has non-zero entries on a subset 



Sk:={iG{l,...,p} I /34/O} 



(5) 



of size Sk ■= \Sk\- We focus on the problem of recovering the union of the supports, 
namely the set S : = U^j^S'fc, corresponding to the subset of indices i G {!,..., p} that 
are involved in at least one regression. This union support problem can be understood as 
the generalization of variable selection to the group setting. Rather than selecting specific 
components of a coefficient vector, we aim to select specific rows of a coefficient matrix. We 
thus also refer to the union support problem as the row selection problem. Note finally that 
recovering S is not equivalent to recovering each of the individual supports Sk ■ 

If computational complexity were not a concern, the natural way to perform row selection 
for B* would be by solving the optimization problem 



arg mm 



1 

2n 



\Y - XB\l'+ \n\\B 



(6) 



where B = {Pik)i<i<p i<k<K i^s a p x K matrix, the quantity denotes the Frobenius 
nornj^l and the "norm" counts the number of rows in B that have non-zero £q norm. 

As before, the £0 component of this regularizer yields a non-convex and computationally 
intractable problem, so that it is natural to consider the relaxation 



arg mm 



1 

2n 



\Y - XB\li + \n\\B\ 



(7) 



where is the block £i/£q norm: 



\B\ 



K 



E4 = Ei 



i=l 



^The Frobenius norm of a matrix A is given by : ~ yj^i^ A, 



(8) 
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The relaxation ([7]) is a natural generalization of the Lasso; indeed, it specializes to the 
Lasso in the case K = 1. For later reference, we also note that setting q = 1 leads to 
the use of the £i/£i block-norm in the relaxation ([T]). Since this norm decouples across 
both the rows and columns, this particular choice is equivalent to solving K separate Lasso 
problems, one for each column of the px K regression matrix B* . A more interesting choice 
is g = 2, which yields a block £i/i2 norm that couples together the columns of B. This 
regularization is commonly referred to as the group Lasso. As we discuss in Appendix |2] 
the group Lasso with q = 2 can be cast as a second- order cone program (SOCP), a family 



of optimization problems that can be solved efficiently with interior point methods (Boyd 



and Vandenberghe 2004 1, and includes quadratic programs as a particular case. 



Some recent work has addressed certain statistical aspects of block-regularization schemes. 



Meier et al. (20081 have performed an analysis of risk consistency with block-norm regular- 



ization. Bach (20081 provides an analysis of block- wise support recovery for the kernelized 
group-Lasso in the classical, fixed p setting. In the high- dimensional setting, Ravikumar 



et al. (20081 have studied the consistency of block- wise support recovery for the group-Lasso 



(•^1/^2) for fixed design matrices, and their result is generalized by Liu and Zhang (20081 
to block- wise support recovery in the setting of general £i/iq regularization, again for fixed 
design matrices. However, these analyses do not discriminate between various values of q, 
yielding the same qualitative results and the same convergence rates for g = 1 as for q > 1. 
Our focus, which is motivated by the empirical observation that the group Lasso can out- 



perform the ordinary Lasso (Bach 2008 Yuan and Lin, 2006 Zhao et al.l 20071 Obozinski 



et al. 20071, is precisely the distinction between q = 1 and q > 1 (specifically q = 2). 

The distinction between q = 1 and q = 2 is also significant from an optimization- 
theoretic point of view. In particular, the SOCP relaxations underlying the group Lasso (g = 
2) are generally tighter than the quadratic programming relaxation underlying the Lasso 
(g = 1); however, the improved accuracy is generally obtained at a higher computational 
cost (Boyd and Vandenberghe 2004 1 . Thus we can view our problem as an instance of 



the general question of the relationship of statistical efficiency to computational efficiency: 
does the qualitatively greater amount of computational effort involved in solving the group 
Lasso always yield greater statistical efficiency? More specifically, can we give theoretical 
conditions under which solving the generalized Lasso problem (Q has greater statistical 
efficiency than naive strategies based on the ordinary Lasso? Conversely, can the group 
Lasso ever be worse than the ordinary Lasso? 

With this motivation, this paper provides a detailed analysis of model selection consis- 
tency of the group Lasso ([7| with £i/£2-regularization. Statistical efficiency is defined in 
terms of the scaling of the sample size n, as a function of the problem size p and sparsity 
structure of the regression matrix B*, required for consistent row selection. Our analysis 
is high-dimensional in nature, allowing both n and p to diverge, and yielding explicit error 
bounds as a function of p. As detailed below, our analysis provides affirmative answers to 
both of the questions above. First, we demonstrate that under certain structural assump- 
tions on the design and regression matrix B* , the group £i/£2-Lasso is always guaranteed 
to out-perform the ordinary Lasso, in that it correctly performs row selection for sample 
sizes for which the Lasso fails with high probability. Second, we also exhibit some problems 
(though arguably not generic) for which the group Lasso will be outperformed by the naive 
strategy of applying the Lasso separately to each of the K columns, and taking the union 
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of supports. 



1.2 Our results 

The main contribution of this paper is to show that under certain technical conditions on 
the design and noise matrices, the model selection performance of block-regularized ^1/^2 
regression ([7| is governed by the sample complexity function 

Tl 

e,,i,,{n,p;B*) := 2 ^.(S*) log(p - .) ' 

where n is the sample size, p is the ambient dimension, s = \S\is the number of rows that are 
non-zero, and 'ip{-) is a sparsity-overlap function. Our use of the term "sample complexity" 
for 9£-^/£2 reflects the role it plays in our analysis as the rate at which the sample size must 
grow in order to obtain consistent row selection as a function of the problem parameters. 
More precisely, for scalings {n,p, s, B*) such that 9£^/£2{n,p; B*) exceeds a fixed critical 
threshold t* £ (0,-|-oo), we show the probability of correct row selection by ii/i2 group 
Lasso converges to one. 

Whereas the ratio is standard for high-dimensional theory on £i-regularization, the 
function ^{B*) is a novel and interesting quantity, which measures both the sparsity of the 
matrix B* , as well as the overlap between the different regression tasks, represented by the 
columns of B*. (See equation ( |15[ ) for the precise definition of tI){B*).) As a particular 
illustration, consider the special case of a single-task or univariate regression with K = 1, 
in which the convex program ([7| reduces to the ordinary Lasso In this case, if the 
design matrix is drawn from the Standard Gaussian ensemble (i.e., Xij ~ A^(0,1), i.i.d), 
we show that the sparsity-overlap function reduces to tp{B*) = s, corresponding to the 
support size of the single coefficient vector. We thus recover as a corollary a previously 
known result (Wainwright 2006| ): namely, the Lasso succeeds in performing exact support 



recovery once the ratio n/[slog{p — s)] exceeds a certain critical threshold. At the other 
extreme, for a genuinely multivariate problem with K > 1 and s non-zero rows, again for 
a Standard Gaussian design, when the regression matrix is "suitably orthonormal" relative 
to the design (see Section |2] for a precise definition), the sparsity-overlap function is given 
by ip{B*) = s/K. In this case, £1/^2 block-regularization has sample complexity lower by 
a factor of K relative to the naive approach of solving K separate Lasso problems. Of 
course, there is also a range of behavior between these two extremes, in which the gain 
in sample complexity varies smoothly as a function of the sparsity-overlap ip{B*) in the 
interval [;^,s]. On the other hand, we also show that for suitably correlated designs, it 
is possible that the sample complexity ij){B*) associated with I1/I2 row selection is larger 
than that of the ordinary Lasso (£i/£i) approach. 

The remainder of the paper is organized as follows. In Section [2] we provide a precise 
statement of our main result, discuss some of its consequences, and illustrate the close 
agreement between our theory and simulations. Section |3] is devoted to the proof of this 
main result, with the argument broken down into a series of steps. Technical results are 
deferred to the appendix. We conclude with a brief discussion in Section |4j 
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1.3 Notation 



We collect here some notation used throughout the paper. For a (possibly random) matrix 
M G RP^^, we define the Frobenius norm |||M|||^ := jfnfjY^'^ , and for parameters 
1 < a < 6 < oo, the la/(-b block norm 



K N -N - 

o I a 



■■= E EK'^n ■ (10) 



These vector norms on matrices should be distinguished from the (a, 6)-operator norms 

|||M||U,fe := sup ||Mx|U, (11) 
l!^IU=i 

although some norms belong to both families; see Lemma [5] in Appendix [B} Important 
special cases of the latter include the spectral norm |||M|||2,2 (also denoted |||M|||2), and the 
^oo"OPGrator norm |-M|oo,oo = "^^^1=1,... ,p Ejli \Mij\, denoted |||M|||^ for short. 



2 Main result and some consequences 

The analysis of this paper applies to random ensembles of multivariate linear regression 
problems, each of the form (El), where the noise matrix W S M"^^ is assumed to consist of 
i.i.d. elements Wij ~ -/V(0, cr^. We consider random design matrices X with each row drawn 
in an i.i.d. manner from a zero-mean Gaussian A^(0, S), where S;^Oisapxp covariance 
matrix. We note in passing that analogs of our results with different constants apply to any 
design with sub-Gaussian rows|^ Although the block-regularized problem ([T]) need not have 
a unique solution in general, a consequence of our analysis is that in the regime of interest, 
the solution is unique, so that we may talk unambiguously about the estimated support S. 
The main object of study in this paper is the probability P[S' = S], where the probability 
is taken both over the random choice of noise matrix W and random design matrix X. We 
study the behavior of this probability as elements of the triplet (n,p, s) tend to infinity. 



2.1 Notation and assumptions 

More precisely, our main result applies to sequences of models indexed by {n^p{n), s(n)), an 
associated sequence oipxp covariance matrices, and a sequence {B*} of coefficient matrices 
with row support 



S 



I K / 0} 



(12) 



of size l^l = s = s{n). We use to denote its complement (i.e., S"^ : = {1, . . . ,p}\5). We 
let 

bl,:,„ : = min ||/5*||2, (13) 



mm 

ie5 



correspond to the minimal ^2 row- norm of the coefficient matrix B* over its non-zero rows. 
We impose the following conditions on the covariance S of the design matrix: 



'See 



Buldygin and Kozachenko (20001 for more details on sub-Gaussian random vectors. 
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{Al) Bounded eigenspectrum: There exists fixed constants Cmin > and Cmax < +00 
such that all eigenvalues of the s x s matrix T,ss are contained in the interval 

[Cmin ) Ctnax] • 

{A2) Mutual incoherence: There exists a fixed incoherence parameter 7 G (0, 1] such 
that 



< 1-7- 



(A3) Self-incoherence: There exist Dmax < +00 such that 



-11 



Assumption Al prevents excess dependence among elements of the design matrix associ- 
ated with the support S; conditions of this form are required for model selection consistency 
or £2 consistency of the Lasso. The mutual incoherence assumption and self-incoherence 
assumptions also well known from previous work on variable selection consistency of the 



Lasso ( Meinshausen and Biihlmann 2006 ; Tropp , 2006 Zhao and Yu 2006 1 . Although such 



incoherence assumptions are not needed in analyzing £2 or risk consistency, they are known 
to be necessary for model selection consistency of the Lasso. Indeed, in the absence of such 
conditions, it is always possible to make the Lasso fail, even with an arbitrarily large sample 



size. (However, see Meinshausen and Yu (20081 for methods that weaken the incoherence 



condition.) Note that these assumptions are trivially satisfied by the standard Gaussian 



ensemble S 



Ipxp, with Cn 



Cry 



1, D„ 



1, and 7 = 1. More generally, it can be 



shown that various matrix classes (e.g., Toeplitz matrices, tree-structured covariance matri- 



ces, bounded off-diagonal matrices) satisfy these conditions (Meinshausen and Biihlmann 
2056l [Zhao and Yu[ [20061 |Wainwright| [2056l ) . 



2.2 Statement of main result 

We require a few pieces of notation before stating the main result. For an arbitrary matrix 
Bs G R"""^ with i^^ row Pi G Ri^-^, we define the matrix C{Bs) G R^''^ with i*^ row 



f3i 



(14) 



With this notation, the sparsity- overlap function is given by 

i;{B) := \\\C{Bsf{J^ssr'C{Bs)\\\2, (15) 
where |||-|||2 denotes the spectral norm. Finahy, the sample complexity function is given by 

Tl 

9,,/,,{n,p;B*) := 2 ^.(5*) log(p - .) " ^^S) 
With this setup, we have the following result: 

Theorem 1. Consider a random design matrix X drawn with i.i.d. A^(0, S) row vectors, 
where S satisfies assumptions Al through A2>, and an observation matrix Y specified by 
model (|4|. Suppose that the squared minimum value (^min)^ decays no more slowly than 
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f{p) min{^, |og(p_g) } for some function f{p)/s and f{p) — > +00. Then for all 
sequences {n,p,B*) such that 

we have with probability greater than 1 — ci exp(c2 log s) : 



(a) the SOCP with A„ = ^ has a unique solution B, and 

(b) the row support set 

S = S{B) := {i I ^i/o} (18) 
specified by this unique solution is equal to the row support set S{B*) of the true model. 
2.3 Some consequences of Theorem [1] 

We begin by making some simple observations about the sparsity overlap function. 

Lemma 1. (a) For any design satisfying assumption Al, the sparsity- overlap ip{B*) obeys 
the bounds 

< HBl < ^ (19) 

•-^max-"- ^min 

(b) If Tiss = Isxs, cmd if the columns (Z^^'^*) of the matrix Z* = C{B*) are orthogonal, 
then the sparsity overlap function is ip{B*) = maxjt 

Proof, (a) To verify this claim, we first set Zg = (^{Bg), and use Z^^* to denote the k^^ 
column of Z^. Since the spectral norm is upper bounded by the sum of eigenvalues, and 
lower bounded by the average eigenvalue, we have 

^tT{zf^-slZ*s) < m*) < ti{zf^slZ*s). 

Given our assumption (^1) on we have 

K ^ , K 

'{'=)*l|2 



u[^s ^ss^s)-z^^s ^ss^s -r Z^H^s n - TT 



k=i -^'"^^ k=i 



using the fact that Ylk=i 11-^5^''* IP = Yli=i W^iM — Similarly, in the other direction, we 
have 

^ T 1 ^ 

fr('7*^V-l 7* \ — Z^'')* <■ \^ II 7('=)*l|2 

tI{Z:s l^ggZ^g) - 2_^Z,g ^SS^S 2^ II ■ 

k=l ""^^ k=l 



which completes the proof. 

(b) Under the assumed orthogonality, the matrix Z*'^ Z* is diagonal with as 
the diagonal elements, so that the largest yz^'^)*!!^ is then the largest eigenvalue of the 
matrix. □ 
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Based on this lemma, we now study some special cases of Theorem [T| The simplest 
special case is the univariate regression problem (K = l), in which case the quantity CiP*) 



(as defined in equation (14|) simply outputs an s-dimensional sign vector with elements 
z* = sign(/3^). (Recall that the sign function is defined as sign(O) = 0, sign(x) = 1 if 
X > and sign(x) = — 1 if < 0.) In this case, the sparsity overlap function is given by 
= z*'^{T,ss)^^z* , and as a consequence of Lemma hfa), we have = @{s). 

Consequently, a simple corollary of Theorem [T] is that the Lasso succeeds once the ratio 
n/ (2slog{p — s)) exceeds a certain critical threshold, determined by the eigenspectrum and 
incoherence properties of S. This result matches the necessary and sufficient conditions 



established in previous work on the Lasso ( [Wainwright 2006 1 . 



We can also use Lemma [T] and Theorem [T] to compare the performance of the group 
Lasso to the following (arguably naive) strategy for row selection using the ordinary Lasso: 

Row selection using ordinary Lasso: 

1 . Apply the ordinary Lasso separately to each of the K univariate regression problems 
specified by the columns of B* , thereby obtaining estimates /J^'^) for k = 1, . . . , K . 

2. For k = 1, . . . , K, estimate the column support via Sk {i \ fif^'' / 0}. 

3. Estimate the row support by taking the union: S = U^-|^5'fc. 

To understand the conditions governing the success/failure of this procedure, note that it 
succeeds if and only if for each non-zero row i £ S = Dj:^^Sk, the variable (3^ is non-zero for 

at least one fc, and for all j £ S'^ = {1, . . . ,p}\S, the variable Pj'^^ = for all A; = 1, . . . , i^. 
From our understanding of the univariate case, we know that for C = 2t*(S), the condition 

n> C max tlj{f3*S''^)log{p - Sk) > C max ^Ij{I3*S^'^) \og{p - s) (20) 

k=l,...,K k=l,...,K 



n < maxk^i j^ t(j{P*g^^) log(p — s), then there will exist some j G S**^ such for at least one 

k G {1, . . . , K}, there holds Pj^^ ^ with high probability, implying failure of the ordinary 
Lasso. 

A natural question is whether the group Lasso, by taking into account the couplings 
across columns, always outperforms (or at least matches) this naive strategy. The following 
result shows that if the design is uncorrelated on its support, then indeed this is the case. 

Corollary 1 (Group Lasso versus ordinary Lasso). Assume that S55 = Isxs- Then for any 
multivariate regression problem, row selection using the ordinary Lasso strategy requires, 
with high probability, at least as many samples as the ii/i2 group Lasso. In particular, the 
relative efficiency of group Lasso versus ordinary Lasso is given by the ratio 

max il){P*g''^) \og{p - Sk) 

^=^'-'^ > 1 (21) 

tl^{B*s)log{p-s) - ■ ' 
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Proof. From our discussion preceding the statement of Corollary [T] we know that the quan- 
tity 

max V(/3s''^)log(f - Sfe) = max Sfclog(p-Sfc) > max Sfclog(p-s) 

k — k — 1 , . . . 5 k — 1 5 . . . 5 

governs the performance of the ordinary Lasso procedure for row selection. It remains to 
show then that ip(Bg) < max^ Sk- 

As before, we use the notation Zg = (^{Bg), and Z* for the i*^ row of Z^. Since 
S55 = Isxs, we have ip{B*) = H-^^Hl- Consequently, by the variational representation of 
the ^2-iiorm, we have 



ih(B*) = max llZoxlP < max (z*^x] . 

xml^ : \\x\\<l a;eM^^:||a;||<l^ V / 

1=1 

Let \Z*\ = . . . , and yi = (xi sign(Z*^), ...,xk sign(Z*^))^. By the Cauchy- 

Schwartz inequality, 

Zfxf = (iZ/fy,)' < II |Z;| fUf = ||Z;|p J^xisign(Z4)2 

k 

so that, if ||x|| < 1, we have 

s ^ s K K s K 

1=1 i=l k=l k=l i=l k=l - - 

thereby establishing the claim. □ 
We illustrate Corollary [T] by considering some special cases: 

Example 1 (Identical regressions). Suppose that B* : = /5*1^ — that is, B* consists 
of K copies of the same coefficient vector f3* E M^, with support of cardinality 15*1 = s. 
We then have [C{B*)]ij = sign{l3*) / Vk , from which we see that = z*^{J:ss)~'^z*, 

with z* being an s-dimensional sign vector with elements z* = sign(/3?). Consequently, we 
have the equality ip{B*) = ip^P^^'^*), so that there is no benefit in using the group Lasso 
relative to the strategy of solving separate Lasso problems and constructing the union 
of individually estimated supports. This fact might seem rather pessimistic, since under 
model Q, we essentially have Kn observations of the coefficient vector /3* with the same 
design matrix but K independent noise realizations. However, under the given conditions, 
the rates of convergence for model selection in high-dimensional results such as Theorem [T] 
are determined by the number of interfering variables, p—s, as opposed to the noise variance. 

In contrast to this pessimistic example, we now turn to the most optimistic extreme: 

Example 2 ("Orthonormal" regressions). Suppose that (T,ss) = Isxs and (for s > K) 
suppose that B* is constructed such that the columns of the s x K matrix C{B*) are all 
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orthonormal. Under these conditions, we claim that the sample complexity of group Lasso 
is lower than that of the ordinary Lasso by a factor of \/K. Indeed, we observe that 



s, 



K 

Kij{B*) = K\\Z^^>f = ^ \\Z^^>f = tr = tr (^Z* Z*^ 

k=l 

because Z* Z*'^ G W^^^ is the Gram matrix of s unit vectors of ^ and its diagonal elements 
are therefore all equal to 1. Consequently, the group Lasso recovers the row support with 
high probability for sequences such that 

n 

> 1, 



2 log(p - s) 

which allows for sample sizes 1 /K smaller than the ordinary Lasso approach. 

Corollary [T] and the subsequent examples address the case of uncorrelated design (Sg^ = 
Isxs) on the row support S, for which the group Lasso is never worse than the ordinary Lasso 
in performing row selection. The following example shows that if the supports are disjoint, 
the ordinary Lasso has the same sample complexity as the group Lasso for uncorrelated 
design T,ss = Isxs, but can be better than the group Lasso for designs T,ss with suitable 
correlations: 

Corollary 2 (Disjoint supports). Suppose that the support sets Sk of individual regression 
problems are all disjoint. Then for any design covariance S55, we have 

max V'(/3^^^*) < i^{B*) < ^HP^''^ (22) 

k-l,...,K 

Proof. First note that, since all supports are disjoint, Z^'^^* = sign(/3*^), so that Z^^^* = 

Cil^P*). Inequality (b) is then immediate, since |||Z^'^i;^^Z^|||2 < tr(Z^^S^_^Z^). To 
establish inequality (a), we note that 

wiB )= max x Zo IjQoZqX > max e%.Zq ijQQZqek= max Zq IjqqZq . 

xeM^:||x|l<l '^•^ ~ l<k<K " 65 i « l<k<K ''^ 

□ 

We illustrate Corollary |2] with an example. 

Example 3. Disjoint support with uncorrelated design Suppose that S55 = Isxs, 
and the supports are disjoint. In this case, we claim that the sample complexity of the ^1/^2 
group Lasso is the same as the ordinary Lasso. If the individual regressions have disjoint 
support, then Zg = Ci^s) only a single non-zero entry per row and therefore the 
columns of Z* are orthogonal. Moreover, Z*j^ = sign(/3|'^^*). By Lemma[l|b), the sparsity- 
overlap function ip{B*) is equal to the largest squared column norm. But = 
Si=i sign(/?j^'^''*)^ = Sk- Thus, the sample complexity of the group Lasso is the same as the 
ordinary Lasso in this casej^ 

^In making this assertion, we are ignoring any difference between log(p — Sk) and log(p — s), which is 
vahd, for instance, in the regime of subUnear sparsity, when Sfe/p — > and s/p —> 0. 
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Finally, we consider an example that illustrates the effect of correlated designs: 



Example 4. Effects of correlated designs To illustrate the behavior of the sparsity- 
overlap function in the presence of correlations in the design, we consider the simple case of 
two regressions with support of size 2. For parameters i?! and G [0, vr] and p £ (—1, +1), 
consider regression matrices B* such that B* = Ci^s) ^^'^ 



cos(t?i) sin(i?i) 
cos(t?2) sin(i?2) 



and 



^ss 



aB*s) 

Setting M* = C{Bg)'^'E^gC{B'^) , a simple calculation shows that 



1 P 
P 1 



(23) 



tr(M*) = 2(1 +pcos(??i-i?2)), and 
so that the eigenvalues of M* are 

p+ = (l + p)(l + cos(i9i-i92)), and 



det(M*) = (1 - p2)sin(19l-^92)^ 



/i- = (!-/>)(! -cos(i?i-i92)). 



so that ipiB*) = max(/i+, /u ). On the other hand, with 



C(/3(^) 



sign(cos(i9i)) 
sign(cos(i92)) 



and Z2 = CiP^^^*) 



sign(sin(t?i)) 
sign (sin ( 1^2 )) 



we have 



zTy-l 
Z2 l^gg 



Z2 



L{cos(tfi)^o} + l{cos(,92)7^o} + 2p sign(cos(??i) cos(i?2)), 
L{sin(tfi)^o} + l{sin(,?2)^o} +2/3 sign(sin(??i ) sin(i?2))- 



Figure |4] provides a graphical comparison of these sample complexity functions. The 
function 1/5(5*) = max(V'(/3(i)*), V'(/3^^^*)) is discontinuous on 5 = fZxMUMxfZ, and, as a 
consequence, so is its difference with iplB*). Note that, for fixed ??i or fixed 1^2, some of these 
discontinuities are removable discontinuities of the induced function on the other variable, 
and these discontinuities therefore create needles, slits or flaps in the graph of the function ijj. 
Denote by T^"*" (resp. ) the set 7^+ = {(i?!, i92)| min[cos('i?i) cos('(92)iSin('!9i) sin(t92)] > 0}, 
(resp. TZ~ = {(t9i, i?2)| niax[cos(i9i) cos('(?2)j sin(i9i) sin('!92)] < 0} ) on which ip{B*) reaches 
its minimum value when p > 0.5 (resp. when p < 0.5) (see middle and bottom center 
plots in figure |4|. For p = 0, the top center graph illustrates that 'ipiB*) is equal to 2 
except for the cases of matrices Bg with disjoint support, corresponding to the discrete set 
V = {{k^, {k± l)f ), k G Z} for which it equals 1. The top rightmost graph illustrates that, 
as shown in Corollary [T] the inequality always holds for an uncorrelated design. For p > 0, 
the inequality ip{B*) < max {i;{P^^>),ij{P^^>)) is violated only on a subset of 5 U 7^ ; and 
for p < 0, the inequality is symmetrically violated on a subset of 5 U TZ^ (see Fig. [4]). 

2.4 Illustrative simulations 

In this section, we provide the results of some simulations to illustrate the sharpness of 
Theorem [T] and furthermore to ascertain how quickly the predicted behavior is observed 
as elements of the triple (n, p, s) grow in different regimes. We explore the case of two 
regression tasks (i.e., K = 2) which share an identical support set S with cardinality IS"! = s 



in Section 2.4.1|and consider a slightly more general case in Section 2.4.2 
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Figure 1. Comparison of sparsity-overlap functions for £i/£2 and the Lasso. For the 
pair ^(??i,i92), we represent in each row of plots, corresponding respectively to p = 
(top), 0.9 (middle") and —0.9 (bottom), from left to right, the quantities: i/'(^*) (left), 
ma.x{tp{(3<^^>),'i{;{p(^>)) (center) and max(0, ?/'(S*) - max(V'(/3(^)*), V'(/3^^^*))) (right). The 
latter indicates when the inequality tp{B*) < max (V'(/3(^^*),V'(/3^^^*)) does not hold and by 
how much it is violated. 



2.4.1 Phase transition behavior 

This first set of experiments is designed to reveal the phase transition behavior predicted by 
Theorem [T] The design matrix X is sampled from the standard Gaussian ensemble, with 
i.i.d. entries Xij ~ A^(0, 1). We consider two types of sparsity, 

• logarithmic sparsity, where s = alog(p), for a = 2/log(2), and 

• linear sparsity, where s = ap, for a = 1/8, 

for various ambient model dimensions p G {16,32,64,256,512,1024}. For a given triplet 
{n,p, s), we solve the block-regularized problem ([7| with the regularization parameter = 
■\/log{p — s) (log s) jn. For each fixed (p, s) pair, we measure the sample complexity in terms 
of a parameter 0, in particular letting n = 0slog(p — s) for Q S [0.25, 1.5]. 

We let the matrix B* £ of regression coefficients have entries f3*j in { — 1 / \/2 , 1 / V^} , 
choosing the parameters to vary the angle between the two columns, thereby obtaining 
various desired values of ip{B*). Since E = Ipxp for the standard Gaussian ensemble, 
the sparsity-overlap function ip{B*) is simply the maximal eigenvalue of the Gram matrix 
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Figure 2. Plots of support union recovery probability P[5'=S'] versus the control parameter 
= n/[2slog(p — s)] for two different types of sparsity, linear sparsity in the left column 
(s = p/8) and logarithmic sparsity in the right column (s = 21og2(p))) and using £1/^2 
regularization in the three first rows to estimate the support respectively in the three cases 
of identical regression, intermediate angles and orthonormal regressions. The fourth row 
presents results for the Lasso in the case of identical regressions. 
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p=256 s=p/8=32 p=1024 s=p/8=128 




Figure 3. Plots of support recovery probability P[S'=S'] versus the control parameter 9 = 
n/[2slog(p — s)] for two different type of sparsity, logarithmic sparsity on top (s = 0{\og{p))) 
and linear sparsity on bottom (s = ap), and for increasing values of p from left to right. 
The noise level is set at u = 0.1. Each graph shows four curves (black, red, green, blue) 
corresponding to the case of independent £i regularization, and, for ^1/^2 regularization, 
the cases of identical regression, intermediate angles, and "orthonormal" regressions. Note 
how curves corresponding to the same case across different problem sizes p all coincide, as 
predicted by Theorem [T] Moreover, consistent with the theory, the curves for the identical 
regression group reach P[S'= S] ~ 0.50 at 6' w 1, whereas the orthogonal regression group 
reaches 50% success substantially earlier. 



C{B*s)^C{B*s). Since \f3*j\ = I/V2 by construction, we are guaranteed that Bg = (^{B'^), 
that the minimum value 6Jj^;^ = 1, and moreover that the columns of C{Bg) have the same 
Euclidean norm. 

To construct parameter matrices B* that satisfy \/3ij\ = l/\/2, we choose both p and the 
sparsity scalings so that the obtained values for s are multiples of four. We then construct 
the columns Z^^^* and Z^^^* of the matrix B* = C{B*) from copies of vectors of length four. 
Denoting by ® the usual matrix tensor product, we consider the following 4-vectors: 

Identical regressions: We set Z^^')* = Z^'^'^* = ^Is, so that the sparsity-overlap function 
is ipiB*) = s. 
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Orthonormal regressions: Here B* is constructed with _L Z^^)*, so that il^iB*) = f , 
the most favorable situation. In order to achieve this orthonormality, we set 
^1, and = -1=^2 ® (1, -1)^- 

Intermediate angles: In this intermediate case, the columns Z^^^* and Z^"^^* are at a 
60° angle, which leads to ip{B*) = |s. Specifically, we set z'^'^^* = ^^Is and 

Figures [2] shows plots of linear sparsity (left column) and logarithmic sparsity (right 
column) for all three cases solved using the group ^l/^2 relaxation (top three rows), as well 
as the reference Lasso case for the case of identical regressions (bottom row). Each panel 
plots the success probability ¥[S = S] versus the rescaled sample size 9 = n/[2slog(p — s)]. 
Under this re-scaling. Theorem [T] predicts that the curves should align, and that the success 
probability should transition to 1 once 9 exceeds a critical threshold (dependent on the 
type of ensemble). Note that for suitably large problem sizes {p > 128), the curves do align 
in the predicted way, showing step-function behavior. Figure |3] plots data from the same 
simulations in a different format. Here the top row corresponds to logarithmic sparsity, and 
the bottow row to linear sparsity; each panel shows the four different choices for B* , with 
the problem size p increasing from left to right. Note how in each panel the location of the 
transition of P[/S' = S] to one shifts from right to left, as we move from the case of identical 
regressions to intermediate angles to orthogonal regressions. 



2.4.2 Empirical thresholds 



In this experiment, we aim at verifying more precisely the location of the £1/^2 threshold 
as the regression vectors vary continuously from identical to orthonormal. We consider the 
case of matrices B* of size s x 2 for s even. In Example Sec. |4]of Sec. 2.3, we characterized 
the value of ip{B*) if i?* is a 2 x 2 matrix. 

In order to generate a family of regression matrices with smoothly varying sparsity /overlap 
function consider the following 2x2 matrix: 



Bi{a) 



1 

cos(| -I- q) 



1 

sin(f + a) 



(24) 



Note that a is the angle between the two rows of i?i(a) in this setup. Note moreover that 
the columns of -Bi(a) have varying norm. 

We use this base matrix to define the following family of regression matrices B% G M''^^: 



^3^ 



Bis{a) = ls/2 ^ Bi{a), a G 



0, 



vr 



}■ 



(25) 



For a design matrix drawn from the Standard Gaussian ensemble, the analysis of Ex- 
ample Sec. [4] in Sec. 2.3 naturally extends to show that the sparsity /overlap function is 
ijj^Bsi^a)) = |(1-|-| cos (a) I). Moreover, as we vary a from to |, the two regressions vary 
from identical to "orthonormal" and the sparsity/overlap function decreases from s to |. 

We fix the problem size p = 2048 and sparsity s = log2(p) = 22. For each value 
of a G [0, we generate a matrix from the specified family and angle. We then solve 
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the block-regularized problem ([T]) with sample size n = 29slog{p — s) for a range of 9 in 
[.25, 1.5]; for each value of 6, we repeat the experiment (generating random design matrix 
X and observation matrix W each time) over T = 500 trials. Based on these trials, we 
then estimate the value of ^50% for which the exact support is retrieved at least 50% of the 
time. Since ip{B*) = ^+1 cos(a)| theory predicts that if we plot 6*59% versus | cos(a)|, it 

should lie on or below the straight line ^+1 cos(a)| ^ ^^^^ perform the same experiments 
for row selection using the ordinary Lasso, and plot the resulting estimated thresholds on 
the same axes. 

The results are shown in Figure |4j Note first that the curve obtained for Si-^/g^ (blue 
circles) coincides roughly with the theoretical prediction ^+1 c^^s(a)| ^|-,^g^p]^ dashed diagonal) 
as regressions vary from orthogonal to identical. Moreover, the estimated ^50% of the 
ordinary Lasso remains above 0.9 for all values of a, which is close to the theoretical value 
of 1. However, the curve obtained is not constant, but is roughly sigmoidal with a first 
plateau close to 1 for cos(a) < 0.4 and a second plateau close to 0.9 for cos(a) > 0.5. The 
latter coincides with the empirical value of ^50% for the univariate Lasso for the first column 
pW* (not shown). There are two reasons why the value of ^50% for the ordinary Lasso does 
not match the prediction of the first-order asymptotics: first, for a = f (corresponding to 
cos(a) =0.7), the support of (3^"^^* is reduced by one half and therefore its sample complexity 
is decreased in that region. Second, the supports recovered by individual Lassos for p^^^* 
and /3(2)* vary from uncorrelated when a = | to identical when a = 0. It is therefore not 
surprising that the sample complexity is the same as a single univariate Lasso for cos(a) 
large and higher for cos(a) small, where independent estimates of the support are more 
likely to include, by union, spurious covariates in the row support. 

3 Proof of Theorem [l] 

In this section, we provide the proof of our main result. For the convenience of the reader, 
we begin by recapitulating the notation to be used throughout the argument. 

• the sets S and 5^^ are a partition of the set of columns of X, such that IS"! = s, \S^\ = 
p — s, and 



the design matrix is partitioned as X = [Xs Xs':], where Xs G M"^^ and X^c E 

nnx(p-s) 



the regression coefficient matrix is also partitioned as B* 
and S*e = E r(p-^)>^k_ p* denote the i^^ row of B*. 



B* 



, with Bl E 



nsxK 



the regression model is given by y = XB* + W , where the noise matrix W E M"^^ 
has i.i.d. A^(0, o"^) entries. 

The matrix = C{B%) E has rows Z* = C{(3*) = E . 
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1.1 




|cos(a)| 

Figure 4. Plots of the Lasso sample complexity 9 — n/\2s log(p— s)] for which the probability 
of union support recovery exceeds 50% empirically as a function of |cos(q;)| for ^i-based 
recovery and li/i2 based recovery, where a is the angle between Z'^^^* and Z'^^* for the 
family Bi. We consider the two following methods for performing row selection: Ordinary 
Lasso green triangles) and group ti/l2 Lasso (blue circles). 



3.1 High-level proof outline 

At a high level, the proof is based on the notion of a primal-dual witness: we construct a 
primal matrix B along with a dual matrix Z such that: 

(a) the pair (i?, Z) together satisfy the Karush-Kuhn- Tucker (KKT) conditions associated 
with the second-order cone program ([T]), and 

(b) this solution certifies that the SOCP recovers the union of supports S. 

For general high-dimensional problems (with p n), the SOCP ([7| need not have a unique 
solution; however, a consequence of our theory is that the constructed solution B is the 
unique optimal solution under the conditions of Theorem [T] 

We begin by noting that the block-regularized problem ([7| is convex, and not difFeren- 
tiable for all B. In particular, denoting by (3i the i''^ row of B, the subdifferential of the 
norm ^i/£2-block norm over row i takes the form 



ft 



if ft / 



[d\\B\\^ lAi = { ' ' (26) 

^ \Zi such that ||.^i||2 < 1 otherwise. 

We also use the shorthand C,{Bi) = /?i/||/3i||2 with an analogous definition for the matrix 
C{Bs), assuming that no row of Bg is identically zero. In addition, we define the empirical 
covariance matrix 



-X^X = -f^WT, (27) 



n n 

1=1 
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where Xi is the i^^ column of X. We also make use of the shorthand T^ss = Ti-^'s-^^ ^^"^ 
5^5=5 = 'k^'s'^^s as well as lis = Xs{'^ss)~'^X's to denote the projector on the range of 
Xs. 

At the core of our constructive procedure is the following convex- analytic result, which 
characterizes an optimal primal-dual pair for which the primal solution B correctly recovers 
the support set S: 

Lemma 2. Suppose that there exists a primal-dual pair {B, Z) that satisfies the conditions: 



Zs 



^ss{Bs - B*s) - ^X^W 



Zs^ 



Bs^ 



C{Bs) 

— ^nZs 

'^S'^siBs - Bg) XgcW 



0. 



(28a) 
(28b) 

< A„ (28c) 

(28d) 



Then {B, Z) is a primal-dual optimal solution to the block- regularized problem, with S{B) = 
S by construction. If T^ss ^ 0, then B is the unique optimal primal solution. 

See Appendix |A] for the proof of this claim. Based on Lemma [2] we proceed to construct 
the required primal dual pair (i?, Z) as follows. First, we set i?5c = 0, so that condition (28d) 



is satisfied. Next, we specify the pair {Bs, Zs) by solving the following restricted version of 
the SOCP: 

55 = arg min Y - X [^^1 I + Xr,\\Bsh,,A . (29) 

Since s < n, the empirical covariance (sub)matrix Tiss = h^^sXs is strictly positive definite 



with probability one, which implies that the restricted problem (29) is strictly convex and 



therefore has a unique optimum Bs- We then choose Zs to be the solution of equation (28b) 



Since any such matrix Zs is also a dual solution to the SOCP (29), it must be an element 
of the subdifferential d 



Bs 



It remains to show that this construction satisfies conditions (28a) and (28c). In order 



zero. From equation (28b) and using the invertibility of the empirical covariance matrix 



to satisfy condition (28a), it suffices to show that no row of the solution Bs is identically 
zero. From equation (28b) ai 
^55) we may solve as follows 

iBs 



Bl 



'SS 



-1 



X^W 



n 



^nZs 



Us. 



(30) 



Note that for any row i S 5, by the triangle inequality, we have 

WPih > ll/3*l|2-||^s||,^/,,. 

Therefore, in order to show that no row of Bs is identically zero, it suffices to show that 
the event 



£{Us) : 



(31) 



19 



occurs with high probabihty. (Recall from equation ( 13 ) that the parameter measures 



the minimum ^2-norm of any row of Bg.) We establish this result in Section 3.2 



Turning to condition (28c), by substituting expression (30) for the difference {Bs — Bg) 
into equation (28c), we obtain a {p — s) x K random matrix Vs^, with rows indexed by S'^. 
For any index j G S"^, the corresponding row vector Vj G is given by 



Vj := Xj[[Us-In] 



W 

n 



n 



(32) 



In order for condition (28c) to hold, it is necessary and sufficient that the probability of the 
event 



< A, 



(33) 



converges to one as n tends to infinity. Consequently, the remainder (and bulk) of the proof 
is devoted to showing that the probabilities P[<5(C/s)] and P[<S(V5c)] both converge to one 
under the specified conditions. 

3.2 Analysis of S{Us)'- Correct inclusion of supporting covariates 



This section is devoted to the analysis of the event £{Us) from equation (31), and in 
particular showing that its probability converges to one under the specified scaling. We 
begin by defining 

With this notation, we have 

Us = T,gi —— - Xn{T,ss) ^Zs- 
Using this representation and the triangle inequality, we have 



n 



\\Us\\ 







1 


w 


< 


2 














1 


w 


< 




2 









+ A„ 



+ A,, 



(S55) 



too 111 



Ti 



T2, 



where the form of T2 in the second line uses a standard matrix norm bound (see equa- 
tion (42a) in Appendix B), and the fact that Z5 < 1. 
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Using the triangle inequality, we bound T2 as follows: 

00 J 



T2 < Xn {|||(S55)-'||L + 



ss) 



v-1 



< A„ {L'max + ^/^ |||(S55)"i|2 ||| ^^S^^^/^)'' " ^« ||L } 



^ Xn \ Drr,3,X ~l~ 



Cn 



,-1 



which defines Xs as a random matrix with i.i.d. standard Gaussian entries. From concen- 
tration results in random matrix theory (see appendix O, for s/n — > 0, we have 



{X^Xs/n)~^ - Is 



< O 



with probability 1 — exp(— 0(n)). Overall, we conclude that 

T2 < Xni I^max + O 

with probability 1 — exp(— 0(n)). 

Turning now to Ti, note that conditioned on X^, we have (vec(l^) | Xs) ~ N{{)sxKi 
Ik) where vec(A) denotes the vectorization of matrix A . Using this fact and the definition 
of the block £00/^2 norm. 



max 

ie5 



ej{^ss)-^ 



W 



< 



ss) 



1/2 
2 



— max C- 
n ie5 



1/2 



which defines Cf ^ independent x^-variates with K degrees of freedom. Using the tail 
bound in Lemma [s] (see Appendix [F|) with t = 2Klogs > K, we have 



1 .2^4i^logs 

— max C > 

n ieS n 



< exp -2i^ logs 1-2 



1 



0. 



Defining the event T : = | 
using concentration results 



SS) 



< 



2 log s ^ 

^ |, we have F[T] > 1 - 2exp(-e(n)), again 



2 '-'mill 

:"rom random matrix theory (see Appendix lO) . Therefore 



SKlogs 

Cm'm ^ 



< 



< 



SKlogs 



T 



1 2 AKlogs 

— max Q > 

n «e5 n 



+ 2exp(-e(n)) 



C>(exp(-e(logs))) ^ 0. 
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Finally, combining the pieces, we conclude that with probability 1 — exp(— 0(log s)), we 
have 



\Us\ 



b* ■ 

mm 



< 



< 



1 



b* . 

mm 



[Ti + T2] 



O 



log s 



n 



n 



With the assumed scaling n = i7 {s\og{p — s)), we have 



\Us\ 



b* 



< 



1 



O 



1 



+ A„ 1 + O 



log(p 



(34) 



with probability greater than 1 — 2 exp(— clog(,s)) — > 1 so that the conditions of Theorem[T] 
are sufficient to ensure that event £{Us) holds with high probability as claimed. 

3.3 Analysis of 8{Vs'^)'- Correct exclusion of non-support 

For simplicity, in the following arguments, we drop the index S"^ and write V for Vs'c. In 
order to show that ll^ll^^/^g ^ with probability converging to one, we make use of the 
decomposition ^ \\V\\^^/^^ < ELi^I where 



T' 



1 



\E[V I Xs]\ 



ioo/h 



\E[V\Xs,W]-E[V\Xs]\ 



K 



\V -¥.[V\Xs,W]\ 



(35a) 
(35b) 
(35c) 



We deal with each of these three terms in turn, showing that with high probability under 
the specified scaling of (n,p, s), we have r{ < (1 — 7), and = Op(l), and T3 < 7, which 
suffices to show that ^ ll^ll^^o/fe ^ ^ with high probability. 
The following lemma is useful in the analysis: 

Lemma 3. Define the matrix IS. G with rows A,; : = C/j/||/9* ||2- A.s long as ||Aj||2 < 1/2 

for all row indices i £ S, we have 



zs - cm: 



< 4IIAI 



See Appendix Id] for the proof of this claim. 



3.3.1 Analysis of T{ 



Note that by definition of the regression model (|4| , we have the conditional independence 
relations 



WALXs'^ \Xs, ZsMXs<^ I Xs, and ZsALXs^ \ {Xs,W}. 
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Using the two first conditional independencies, we have 

E[V\Xs] = E[Xl\Xs]([ns-In]^^^^^-Xn—{^ssy'^[Zs\Xs] 



n 



n 



Since E [Ty|X5] = 0, the first term vanishes, and using E[^Jv|^s'] — '^S'^s'^ sS'^'s ^ 



obtain 



E[y I Xs] = \rj:s^s^ts^[Zs\Xs]. 



(36) 



Using the matrix-norm inequality (42a) of Appendix |B] and then Jensen's inequality yields 

= \\^S'=s^~sl^[Zs\Xs\h^/i2 

< \\\^s^s^ss\\L^[\\Zs\\e^/ijXs] 

< (1-7)- 



3.3.2 Analysis of 

Appealing to the conditional independence relationship Z^iLX^c | {Xs,W}, we have 



W 



Xs 



E [V I Xs, T^] = E [Xfsc I Xs, W] [Hs -In] {^33)''^ [ZslXs, W] 



n 



n 



Observe that E [^51X5, W] = Zs because [Xs, W) uniquely specifies Bs through the convex 
program (29), and the triple {Xs, W, Bs) defines Zs through equation (28b). Moreover, the 
noise term disappears because the kernel of the orthogonal projection matrix (/„ — lis) is 
the same as the range space of Xs, and 

E [Xj. \Xs,W] [lis -In] = E [Xj. I Xs] [lis - In] 

= '^S'^s'^ssX'si^S - In] = 0. 

-^'Es'^s'^ss^s^ ^° that we can conclude that 



We have thus shown that E[V \ Xs,W] 



< (1-7)E 



Zs-Z*s 



+ (1-7) 



Zs-Zl 



< (1-7)4 E [II A 



,] + l|A| 



where the final inequality uses Lemma [3j Under the assumptions of Theorem [T] this final 
term is of order Op(l), as shown in Section [3.2[ 

3.3.3 Analysis of T3 

This third term requires a little more care. We begin by noting that conditionally on Xs and 
W, each vector Vj £ is normally distributed. Since Cov(X(-') | Xs,W) = {Es'=\s)jj ^n, 
we have 



Cov(y,- I Xs,W) = Mn {^s^\s)n 
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where the K x K random matrix M„ = Mn{Xs, W) is given by 



A2 
n 



""Zlitssr^Zs + l^W^ilis - In)W. 



(37) 



Conditionally on W and X5, the matrix M„ is fixed, and we have 

( \\V, - E [V, I Xs, W] \\l \W,Xs) = I s)^^ ijMniy 

where ~ N{Ok, Ik)- Since {T,sc^g)jj < (T,s'^s'')jj < C'max for all j, we have 

max (Lgc I s)jj ('jMnij < Cmax |||Mn||l2 max II2 

where |||Mn|||2 is the spectral norm. 

We now state a result that provides control on this spectral norm. Intuitively, this 
result is based on the fact that the matrix ttM„ is a random matrix that concentrates 

n 

in spectral norm around the matrix M* = Z'^'^{T.ss)~^Z*^, where Z*g = C{B*g), and the 
fact that the spectral norm of M* is directly proportional to the defined sparsity /overlap 
function i^iB*) := \\\({B*sf{^ssr'C{B*s)\\\^. 



Lemma 4. For any S > 0, define the event 



T{6) := { 



\Mn\h < K 



n 



(1 + ^)}- 



(38) 



Under the conditions of Theorem [7| for any 5 > 0, there is some ci > such that 
IP[T(5)^] < 2exp(-ci logs) ^ 0. 

See Appendix |E] for the proof of this lemma. 

Using Lemma |4] we can now complete the proof. For any fixed 6 > (which can be 
made arbitrarily small), we have 

PK>7] < IP'K>7 I T{6)]+F[T{6n 

Since P[T((5)'^] — > from Lemma |4| it suffices to deal with the first term. Conditioning on 
the event T(6), we have 



> 7 I T(<5)] < 



max||ej||2> ^ 



n 



Define the quantity t*{n,B*) : ■ 



1 Y 



Cnxax i^{B*) (1 + 6)' 

and note that t* — > +00 under the 



2 Cmax 4'{B*) (1 + 5)' 

specified scaling of {n,p,s). By applying Lemma [S] from Appendix [F| on large deviations 
for x^-variates with t = t*{n, B*), we obtain 



P[r^ > 7 I r{5)] < (p-s) exp -t* 



1 - 2 



< (p-s) exp (-f (1-5)) 



(39) 
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for {n,p, s) sufficiently large. Thus, the bound (39) tends to zero at rate C'(exp(— clog(p — 
s))) as long as there exists > such that we have (1 — 6) t*{n, B*) > (1 + z^) log(p — s), 
or equivalently 



^ > + PV'li?*) log(p-.)], 



as claimed. 



4 Discussion 

In this paper, we have analyzed the high-dimensional behavior of block-regularization for 
multivariate regression problems, and shown that its behavior is governed by the sample 
complexity parameter 

ee,/i,{n,p,s) := n/[22P{B*)log{p - s)], 

where n is the sample size, p is the ambient dimension, and is a sparsity-overlap function 
the measures a combination of the sparsity and overlap properties of the true regression 
matrix B* . 

There are a number of open questions associated with this work. First, note that the 
current paper provides only an achievability condition (i.e., support recovery can be achieved 
once the control parameter is larger than some finite critical threshold t*). However, based 
both on empirical results (see Figures [2] and [s]) and technical aspects of the proof, we 
conjecture that our characterization is in fact sharp, meaning that the block-regularized 
convex program ([T]) fails to recover the support once the control parameter ^^^/^j drops 
below some critical threshold. Indeed, this conjecture is consistent in the special case of 



univariate regression with K = 1, where it is known (Wainwright 20061 that the Lasso fails 
once the ratio n/[2slog{p — s)] falls below a critical threshold. Secondly, the current work 
applies to the "hard" -sparsity model, in which a subset S of the regressors are non-zero, and 
the remaining coefficients are zero. As with the ordinary Lasso, it would also be interesting 
to study block-regularization under soft sparsity models (e.g., iq "balls" for coefficients, 
with q < 1), under an alternative loss function such as mean-squared error, as opposed to 
the exact support recovery criterion considered here. 

Acknowledgements 

This research was partially supported by NSF grants DMS-0605165 and CCF-0545862 to 
MJW and by NSF Grant 0509559 and DARPA IPTO Contract FA8750-05-2-0249 to MIJ. 



A Proof of Lemma [2] 

Using the notation Pi to denote a row of B and denoting by 

fC := {{w,v) eR^ xR \ \\w\\2 <v} (40) 
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the usual second-order cone (SOC), we can rewrite the original convex program ([7| as 



mm 

B e RP^-^ 
beRP 



S.t. {Pi,bi)GlC, l<i<p. 



We now dualize the conic constraints (Boyd and Vandenberghe 2004), using conic Lagrange 
multipliers belonging to the dual cone 



The second-order cone /C is self-dual (Boyd and Vandenberghe 2004), so that the convex 



program (41) is equivalent to 



mm max 

beRp teRp 



-. p p 

— \IY - XB\ll + A„ 6i - A„ ^ {-zfPi + ti hi) 



4 = 1 



1 = 1 



S.t. {zi,ti) £ IC, 1 < i < p, 

where Z is the matrix whose i^^ row is Zi. 

Since the original program is convex and strictly feasible, strong duality holds and any 
pair of primal {B*,b*) and dual solutions {Z*,t*) has to satisfy the Karush-Kuhn- Tucker 
conditions: 



V 



B 



mh<b*, i<i<p 

\\z*\\2 <ti, I <i <p 

zfpt - m = 0, i<i<p 

+ XnZ* = 



— |||y - xBil 



(41a) 
(41b) 
(41c) 

(41d) 

(41e) 

(3* II 2, a primal- dual 
solution to this conic program is determined by {B*, Z*). 

Any solution satisfying the conditions in Lemma [2] also satisfies these KKT conditions, 
since equation (28b) and the definition (28c) are equivalent to equation (41d), and equa- 



B=B* 

A„(l-t*) = 



Since equations (41c) and (41e) impose the constraints t* = 1 and b* 



tion (28a) and the combination of conditions (28d) and ( |28c ) imply that the complementary 
slackness equations (41c) hold for each primal-dual conic pair {f3i,Zi). 

Now consider some other primal solution B; when comb ined with the o ptimal dual 
solution Z, the pair (B,Z) must satisfy the KKT conditions (Bertsekas, 1995). But since 
for j G S^, we have ||5j||2 < 1, then the complementary slackness condition (41c) implies 
that for all j G S^, f5j = 0. This fact in turn implies that the primal solution B must also be 



a solution to the restricted convex program (29), obtained by only considering the covariates 
in the set S or equivalently by setting iJ^c = O^c. But since s < n by assumption, the matrix 
X'^Xs is strictly positive definite with probability one, and therefore the restricted convex 



program (29) has a unique solution B*^ = Bg- We have thus shown that a solution [B, Z) 
to the program ([T]) that satisfies the conditions of Lemma |2] if it exists, must be unique. 



26 



B Inequalities with block-matrix norms 

In general, the two families of matrix norms that we have introduced, ||| 
are distinct, but they coincide in the following useful special case: 

Lemma 5. For 1 < p < oo and for r defined byl/r + l/p = l we have 



and 



Proof. Indeed, if Oj denotes the i^^ row of A, then 



l^ll 



/« = max a,- L = max max w- a,- = max max u a,- 

= i " i Mr<l' \\y\\r<l I ' 



,max Pylloo 
l!y||r<i 



II ^11 



We conclude by stating some useful bounds and relations: 

Lemma 6. Consider matrices A G l^™x" and Z G M"^^ and p,r > with ^ + j. 
have: 



\AZ\ 



\\AZ\\ 



oo, oo oo, r 



< III/, 



mlllr, oo m4 oo, r 



||Z|| 
s^'' \\A\ 



,oo . 



■^oo / ^p 



□ 

1, we 

(42a) 
(42b) 



C Some concentration inequalities for random matrices 

In this appendix, we state some known concentration inequalities for the extreme eigenvalues 



of Gaussian random matrices (Davidson and Szarekl 20011 . Although these results hold 



more generally, our interest here is on scalings (n, s) such that s/n ^ 0. 

Lemma 7. Let U € M"^'* he a random matrix from the standard Gaussian ensemble (i.e., 
Uij ~ iV(0,l), i.i.d.). Then 



1 



n 



-U^U-Isxs 



> 



< 2exp(— cn) — > 0. 



(43) 



This result is adapted easily to more general Gaussian ensembles. Letting X = t/vA, 
we obtain an n x s matrix with i.i.d. rows, Xi ~ A^(0, A). If the covariance matrix A has 
maximum eigenvalue Cmax < then we have 



In^^X^X-Al 



< C^ax llln-i U^U - /II 



(44) 



so that the bound (43) immediately yields an analogous bound on different constants. 



The final type of bound that we require is on the difference 



(X^X/n 



,-1 



A" 



27 



assuming that X'^X is invertible. We note that 

\\\iX^X/n)-' -A-% = |||(X^X/n)-i[A-(X^X/n)]A-i|||2 

< \\\{X^X/n)-^\\\^\\\A-{X^X/n)\\\2 i^^^llL- 

As long as the eigenvalues of A are bounded below by Cmin > 0, then |||A^-'^|||2 < 1/Cmin- 
Moreover, since s/n —>■ 0, we have (from equation (44)) that |||(X-^X/n)~-'^|||2 < 2/Cmin 



with probability converging to one exponentially in n. Thus, equation (44) implies the 
desired bound. 



D Proof of Lemma [3] 

From the previous section, the condition ||Aj||2 < 1/2 implies that Pi ^ and hence 
Zi = (3i/\\Pi\\2 for all rows i £ S. Therefore, using the notation Z* = /3*/||/3*||2 we have 



Zi-Z* 



2 



Z*+Ai 

\Z*+A 



i\\2 



Z* 



1 



Z* +A 



i\\2 



1 + 



A,. 



\Z*+Ai\\2 



Note that, for z^O, g{z,5) 



is differentiable with respect to 5, with gradient 



Vs g{z, 6) = ~2j|2+fjp' mean-value theorem, there exists h £ [0, 1] such that 



1 



l=g{z,5)-g{z,0)=Vsg{z,h6f6 



\\z + 5\\2 

which implies that there exists hi £ [0, 1] such that 

117 7*11 < 117*11 \iZ*+h,A,fA,, 



'2||z + M||3^ 



lA 



i||2 



2||Z* + /iiA-ii3 



< 



|A 



j 2 



+ 



■iL^i\\2 \\Z* + Aj||2 

"Ailb 



2\\Z* + hiAi\\l ' \\Z*+A,\\2 



(45) 



We note that ||^j*||2 = 1 and ||Aj||2 < h imply that \\Z* + /iiAi||2 > h. Combined with 



inequality (45), we obtain \\Zi — Z^\\2 < 4||Aj||2, which proves the lemma. 



E Proof of Lemma S] 

With Zt; = C{Bs), define the K x K random matrix 
M* :-- 



^{Z*sf{^ss)-'Z*s + ^W^iln - Us)W 
n 



and note that (using standard results on Wishart matrices ( [Anderson 1984 1) 

A? 



n — s 



n — s — 1 



{Z*sY{^ss)-'Z*s + a'^lK. 



(46) 
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To bound M„ in spectral norm, we use the triangle inequality: 

|||M„|||2 < |||M„ - m:|||2 + m: - E [M:]|||2 + |||E [M^ 



2.- 

As 



(47) 



Considering the term Ai in the decomposition (47), we have: 

IK III 2 



^2 



/7* — 1 r?* /y v^ — 1 r7 



n 

~ n 



^s^sU^s - ^s) + {Zs - Zs)^gl{Zs + {Zs - Zg)) 



^-1 
^ss 



Z*s - Zs 



2 III -^S III 2 + 



Z*s - Zs 



(48) 



Using the concentration results on random matrices in Appendix jCl we have the bound 



^SS 



< 2/Cmin with probability greater than 1 — 2exp(— cn), and we have |||Z 



5 III 2 



0{y/s) by definition. Moreover, from equation (42b) in Lemma jsj we have 

we have — Zs 



Zl-Zs 



Z% - Zs 



. Using the bound l\34n and Lemma 



< 

2 

0(1) 



with probability greater than 1 — 2 exp(— clog s), so that from equation (48), we conclude 
that 



Ai = |||m: - M„|||2 = 0(^1 ^-^-P- 



(49) 



Turning to term A2, we have the upper bound A2 < Tj'^ + , where 



Tl 



A? 



n III III 2 
- Ill^5lll2 



n 



n — s — 1 



We have Tl = o I ) with probability greater than 1 — 2exp(— cn), since |||-^5'|||2 ^ s, and 



0(1) with high probability (see Appendix 



tJ, we have with probability greater than 1 — 2exp(— cn). 



n 



\ \\\w'^{In-Ils)W-a\n-s)lK\\\^ = o'^ 



n 



C ) . Turning to 



Ms 



n 



since A^s — > +00. Overall, we conclude that 

A2 = |||m:-e[m:]|||2 



An^ 

n 



w.h.p. 



Finally, turning to As = |||E [M*]|||2 , from equation (46), we have 



||E[M, 



nJlll2 



< 



n n — s — 1 



+ 



n 



n 



Mm*) 



n 



(50) 



(51) 
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Combining bounds (49), ( [50|), and ( [sT] ) in the decomposition (47), and using the fact 
that iIj{B*) = Q{s) (see Lemniama)) yields that 



||M„|||2 < (1 + 0(1)) 



n 



with probabihty greater than 1 — 2exp(clogs), which estabhshes the claim. 



F Large deviations for x^-variates 

Lemma 8. Let Zi,...,Zm be i.i.d. x^-variates with d degrees of freedom. Then for all 
t > d, we have 



[ max Zi >2t] < m exp I — t 

i=l,...,m 



1 - 2 



(52) 



Proof. Given a central x^-variate X with d degrees of freedom, Laurent and Massart ( 1998 ) 
prove that F[X — d> 2\fdx + 2x\ < exp(— x), or equivalently 

F X >x + {^/^ + Vd)^ < exp(-x), 
valid for all x > 0. Setting ^/x + \fd = \[t^ we have 

(a) r ^ -, ^ 

F\X>2t\ < F X>{Vt-Vdf + t < exp{-{Vi - Vdf) 

< exp(-t + 2Vtd) 



exp —t 



1 - 2 



where inequality (a) follows since y/i > ^fd by assumption. Thus, the claim (52) follows by 
the union bound. □ 
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